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and reduced Instruction memory requlromento 
of up to 61% were measured. The techniques 
uaed to restructure AIX executebles are 
discussed, and the performance Improvements 
and memory reductions measured for several 
application programs are presented. 



This paper presenia the design and 
Implementation of trace-directed program 
restructuring {TDPR) for AIX® executable 
programs- TDPR Is the process of reordering 
the Instructions In an executable program, 
using an actual execution profile (or 
Instruction address trace) for a selected 
workload, to Improve utilization of the 
existing hardware architecture. Generally, the 
application of TDPR results in faster programs, 
programs that use less real memory, or both. 
Previous similar work [l-fi] regarding profile- 
guided or feedback-directed program 
optimization has demonstrated significant 
improvements for varloud architectures. TDPR 
applies these concepts to AIX executable 
programs at a global level (l.e,, Independent 
of procedure or other structural boundaries) 
running on the POWER, POWERS™, and 
PowerPC 601 machines and adds the 
methodology to preserve correctness and 
debuggablfiiy for reordered executables. Using 
the prototype tools developed for this effort 
on a selection of both user-level application 
programs and operating system (kernel) code, 
Improvements In execution time of up to 73% 

^ *^ , »• * ^ pmTtefl Of (Wfl paper mirt( 0* oOtrined fteitn the Bdrtor. 



Introduction 

Today's high-pcrfonnance computer memory architectures 
arc optimized for programs which exhibit high spatial 
and/or leniporal locality for both itisttuctions data. 
Memory hierarchies have evolved in an attempt to 
minimize cost and maximize performance by exploiting 
this "locality of reference" program characteristic. 
Similarly, design assumptions arc typically made regarding 
other program characteristics <such as branching behavior) 
which result in processor designs optimized for those 
assumed characteristics (such as branch prediction). 

As long as these program agsutnptions hold, processor 
performance is maximized. However, when a program 
deviates Dx>m these assumed characteristics^ the processor 
architecture is inefiRdcntly utilized, which usualty leads to 
reduced performance or excessive use of real memory. 

While hardware design tradeoflEs arc made on the basis 
of software-related assumptions, compilers attempt lo 
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generate "optimum" code targeted for a spedftc hardware 
architecture (including the memory hierarchy) on the basis 
of similar program assumptions. Ho^vever, compiler 
optimizations are usuaJly limited to a purdy static analysis 
of a program which irw^hidcB speculation as to how a 
program will probably execute on a given hardware 
platform, AdditioDally, since many programs result 
from binding together muHiplet separately compiled (or 
assembled) ot^ect modules, the compiler does not iBualJy 
have a "global view" of the Anal organization of the 
executable Image and therefore cannot perform a truly 
global optimization. 

TDPR effectively "closes the loop" in the optimization 
process. It attempts to further optimise a program by 
collecting information on the actual behavior of a program 
while it i9 executed and using that infomaetion to reorder 



and modify instructions across the entire executable 
program image to optimize the use of the hardware. 

Consider the following simple example of poor program 
locality for a typical high-level language code sequence; 

If (X =^ y) 

{ 

/• En'or handler code V 

} 

/* Otherwise, execution oontinuos hare */ 

In this code sequence, the error path (taken when 
variable x is equal to variable y) is usually not executed 
(information which is not known at compile time), Fjgiire 1 
shows the resulting assembler code generated for a typical 
code sequence of this type* The example represents a 
machine with \6 instructions per instruction cache line* 
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Notice that although only iha first four instructions are 
usually executed [the instructions for the if (x == y) 
statement], the remaining unexecuted instiucriona 
(representing the error handler code) arc also loaded into 
the cache* Since the minimum alJocatabIc unit of a cache 
(typically a cache line) is usually much larger than a single 
instruction, poor program locality results in higher miss 
rates, aod therefore reduced performance, due to 
inefiRdent cache utilization. Similarly, real memory space 
may be wasted an instructions which arc usually not 
executed but, due to their proximity to frequently executed 
code, are loaded when a real page is allocated. 

Figure I also shows the results of reordering the 
instructions according to the way in which they are 
executed. On the basis of information collected at run 
time, the fTequcatly executed code paths arc grouped 
together. The result is improved performance [due to 
reduced instruction cache and TLB (translation lookaside 
buffer) miss rates] and a reduction in run-time memory 
TEquirements (due to improved utilization of real memory 
pages). 

Also, the conditional branch instniction has been 
receded with a different branch target address and 
the opposite (reversed sense) condition code (from a 
BNE Target Address to BEQ Falljhrujiddress)^ 
This illustrates an additional opportunity to improve 
performance, on the basis of actual program behavior, by 
reducing inefficient use of available hardware optimisations 
(which, in this case, are reduced pipeline Stalls due to 
incorrectly prcdicted-not-takcn branches). 

Another improvement, which results Indirectly from 
stringing together frequently executed code paths, is that 
of reduced collisions in an N-way set-associative cache. Tf 
more than N instructions in a highly executed code loop 
map to the same cache congruence class, constant cache 
misses will occur because of the thrashing which results 
from these collisions. Reordering the instructions in a 
program according to the actual execution path potentially 
produces additional performance improvements by 
reducing *'oonflict misses" in an A^-way set-associative 
cache. 

TDPR process overview 

The process of applying trace-directed program 
resuucturing is illustrated in Figure 2, First, the executable 
program to be restructured is run for the desired workload 
(W) while an Instruction address trace (or execution 
profile) is captured and analyzed. The result of this 
analysis is an address reorder list which represents the 
"optimal" ordering of the instructions hi that executable 
program image for the given workload. Second, the 
address reorder list and the executable program file 
arc used to create a new, restructured, executable by 




New, r a ord cic d cxcwmbte 
(opdmircrf for workload W) 




reordering the tnstructigns from the original program 
in the sequence spedfied in the reorder list 

The reordered executable resulting from applying the 
TDPR process will exhibit varying degrees of performance 
improvement and/or reduced instruction memory 
requirements when run on workload W (or similar 
workloads). 

Design and Implementation of TDPR for AIX 
executables 

The design and implementation of irace-dircctcd program 
restructofing for AIX* executable programs entails solving 
two major problems: I) managing dynamically calculated 
branches (computed goto's) and 2) generating an 
"optimal" address reorder list. Once these problems arc 
solved, the remainder of the cfiort revolves around the 
fairly simple repositioning and accounting required to build 
the reordered executable. 

In this implementation, the minimum reorderable unit is 
the basic block (a basic block is defined as a sequcnca of 
instructions that has exactly one entry point and exactly 
one exit point). The addresses specified for TDPl^ are the 
addresses of the first instruction in the basic block. When 
a basic block is moved while reordering an executable, all 
of the mstnictions in the basic block arc moved together, 

• Managing dyruuni^alfy calculated branches 
The branch target or destination address of a dynamically 
calculated branch (DCB) is calculated as a program runs 
and is usually difficult if not impossible to determine 
statically. For the POWER. POWER2™ and PowerPC 601^"^ 
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pn}cessois, the DCB takes the fonn of a bf Bnch^oregistCf 
instruction. In order to move instructions during TDPR, 
some nnecfaanism must be provided to elimmatc 
the problem of a DCB calculating and branching to tht 
address of ar instruction that has been moved. One such 
tnechadisua would be to attempt to recognize all possible 
types of DCBs generated for some subset of all compilers 
(and compiler vexslons) used to create the executable 
prcgramfi. The problem with this approach is that it is not 
fail-safe, and program functionaKly or correctness cannot be 
guaranteed because of (he possibility of UTUintidpated code 
sequences {such as might ariac vnth different compiler 
vCiTsions or with uscr-writcen, "nonstandard" assembler 
fxrogram&). 

The mechanism developed to manage dynamically 
calculated branches for this implementation of TDPR is 
iUustrated in Jlgnre 3. The idea is to keep the original text 
(instruction) section intact except for instructions that 
are reordered (i.e., moved during TDPR). Reordered 
instructions are appended to the end of the original 
executable (in the "reordered text area") and are replaced 
(in the "original text area") with branches to the new 
addresses where the instructions have been moved. 

For example* Instructions 1, 2, and $ in the original text 
section shown in Figure 3 have been moved to locations 
Li, L2, and L3 in the reordered text area, and the origmal 
instructions in the patched original text area have been 
replaced with branches (B) to locations Li , L2, and LS, 
respectively. Instruction 4 and the Branch reg (brandi to 
register) itistruction, which are not part of a frequently 



executed code path m this example, are not moved. 
Additionally, all traccback entries (which arc embedded at 
the end of each procedure for pro^om debug) are removed 
from highly executed code paths (i.e., not moved with 
rwrdcred code) but are maintained in the original text 
section for debuggabiHty. With this mechanism in place. If 
an unanticipated DCB attempts to branch to the address of 
a moved mstruction (such as the Branch rog lo location 
12), It will simply the encounter the branch (B L2) to the 
new k)cation of the instruction and then branch to that 
new location, thus preserving functionalicy. 

Although this technique for managing DCBs does 
maintain functionality for roost programs (high-level 
language and assembler alike), it can be undesirable from 
the perspective of performajice and memory utifeation 
because of the double branch sequence, resulting from 
undetected DCBs, which usually torces two memory pages 
to be touched. However, the vast majority of DCBs found 
in AIX exBcutables are due to 1) the C switch/tase 
statement (which typically generates a branch table in the 
program constant area) and 2) calling a function through a 
pointer (which uses a function descriptor in the program 
data area). This double branch sequence can usually be 
eliminated by updating the addresses of moved instructions 
in the branch tables and function descriptors with the new 
reordered addresses. In this implementation of TDPR, 
both the branch tables and the function descriptors arc 
scanned for the addresses of moved instructions and are 
(optionally) modified with the correct reordered addresses. 

Using this mechanism for managing DCBs* a branch CO 
the reordered text area is executed once when a program 
first begins; from that point on, execution is constrained 
to the optimized reordered text area* If, however, bo 
unanticipated DCB (i,e., one that is undetectable and/oi 
cannot be modified) is encountered during program 
execution, the performance improvement gained by 
reordering may degrade slightly* but the program will 
continue to produce the expected results. 

• Generating an address reorder list for TDPR 
To apply TDPR to a program, the instruction address trace 
(or profile) collected during program execution must first 
be analyzed to determine an "opthnal" basic block 
ordering which will result in the maximum speedup 
(execution time improvement) and/or memory requirement 
reduction. Determining the optimal ordering of the baste 
blocks in a program is a challenging problem. The 
approach used here (similar to that discussed in [3p is to 
attempt to identify the most frequently executed paths 
through the code by building a directed flow graph (DFG) 
from the address trace (or profile) collected during program 
execution. 

The DFG consists of a node for every basic block with 
an associated count of the number of times that basic 
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block was executed. Additionally, eacii node has one or 
more edges (or pointers), with as30Ciaccd counts, to the 
node of the basic block or blocks which are executed next. 
For example^ Figure 4 shows the DFG generated for the 
foUowuDg sequential mstniction sdshtss trace; 

200, 800, 100. 800, 400. 20O, 8O0, 400, 200. BOO, 
400. 200. 800, 400. 200, 800, 400. 200, 800. 400, 
200, 600, 200, 600. 200. 700 

Id this simple example trace, the basic block at address 
200 is executed first, followed by the basic blocks at 
addresses 800, 100, 800 and so oo up to the last basic 
block at address 700. The basic block at address 200 
was executed a total of nine times, six of which ended in 
transferring control to the basic block at addrc^ 800, two 
going to 600, and one to address 700. As can be seen in 
the DFG, the frequently executed or "hot" code path for 
this address trace is the sequence 200-800-400. 

The algorithm used in this implementation for generating 
the tecrder list is described aS follows: 

1. Build the DFG from the instruction address trace or 
profile as shown in Figure 4. 

2. Provide the following allemate methods for traversing 
the DFG to produce the address reorder list: 

a. Starting with the most frequently executed basic 
block, follow the most frequently executed paths 
until a cycle is detected (i.c,, a previously visited 
basic block). As each basic block node is visited 
in the DFG, append the basic block address to 
the address reorder list. When a cycle is 
detected, restart the process at the next most 
frequently executed address. This is the np = 0 
option. 

b. Same as (a), except that when a cycle is detected, 
back up one node and then go visit each next 
most frequently executed basic block. This is 

the np ^ i option. 

c. Same as (b) except that when backing up to visit 
each next most frequently executed basic block, 
visit only those nodes which arc executed next 
more than N times. This is the np = N option. 

Table 1 shows the address reorder lists generated for the 
DFG shown in Figure 4 usitiig this algorithm. 

Whilt the slight differences in the reorder lists shown 
tnay appear inconsequential, the performance differences 
can be significant f6r large code sequences which approach 
or exceed the size of the instcuctioD cache. Selecting the 
appropriate np option, however, is usually a matter of trial 
and error (although the np = 0 option usually provides the 
best speedup for most programs in this implementation)- 



Entry 




Exit 




Tdble 1 Reorder lists for different np options. 



np = 0 


np = } 


np =N " 2 


200 


200 


200 


800 


800 


800 


400 


400 


400 


600 


100 


600 


100 


600 




700 


700 





% Maintaining basic block movement 
The remainder of the implementation involves the 
housekeeping required to accommodate the movement of 
basic bJocks within the program while maintaining the 
expected functionalily. In this implementation of TDPR, 
basic blocks are moved sequentially in a sin^c pass* as 
specified in the address reorder list. 

The diagram shown in Figure 5 illustrates the movement 
of a basic biock (BSn) from its original position in the 
program to its new location (in the reordered text area). Jn 
this example, basic block BBn branches to the basic block 
at address L1, and two basic btocks (Bl and B2) both 
branch to BBn. When a basic block is moved, both Che 
branch out of Iho basic block (if it exists) and all branches 
into the basic block must be adjusted. 

Basic block tngvemcnt is managed by maintaining a dual 
entry log for each basic block in the original text section. 
The first entry is an address that indicates where the basic 
block for Ehfe tog cntiy has been moved. The second entry 
is a pointer to a list of all basic blocks that branch to the 
basic block for this log entry. Whenever a basic block is 
moved, the mQved_to log entry for that block is assigned 
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the new address of the basic block, and all basic blocks 
which branch to the block to be moved (indicated by the 
branch0S_here entjy) are adjusted to branch to the dcw 
location. 

♦ Branch replacement 

During the course of moving basic blocks while applying 
TDPR, the opportunity or requirement may arise to modify 
the branch that usually terminates a basic block. This 
modification may ccmc in one of the following forms: 

1. Changing the sense of a conditronal branch (and 
modifying the branch target address) to improfve 
hardware branch prediction. 

2. Converting to a branch sequence to handle "branch 
target out of range'* and "moved fall-througb code" 
problems. The *'branch target out of range*' problem 
occurs if a target address is not reachable from the 
address of a branch instruction (because of the size 
of the branch displacement field in the instruction); 



**moved fall-through code" problems occur if the code 
which foUows a basic block is moved elsewhere. 

3, Adjusting branch target addresses due to moved basic 
blocks. 

4. Eliminating a branch instruction altogether. 

The branch replacement algorithm in this implementation 
consists of two main cases: 1) branch-to-rcgister. and 
2) branch immediate (not to register)* For the branch-to- 
regisler case, i£ the basic block at the fall-through address 
(i.e.. immediately following the basic block) will not be 
moved next, an additional branch to the fell-through code 
is inserted (if needed). For the conditional branch 
immediate, depending upon whether the basic bbck at the 
fall-through or target address is the next basic block in 
sequence, the branch condition is adjusted (if possible and 
necessary) such that the branch will be predicted contctly 
most often (where the sequence of the basic blocks from 
the reorder list implies the most frequently executed (wth). 
Also, the branch target range for wasting or modified 
branches is examined, and unconditronal "far" branches 
are added if the branch target or fall-through address h out 
of range. 

TDPR for user-level programs 

Applying TDPR to user-level application programs 
involves the following: 

1. Reading/decoding the AIX XCOFF (eXtcndcd ComnfK>n 
Object File Format) executable program image and 
collecting the different sections (data, text, etc). 

2. Reordering the text section. This is done by applying 
the techniques described above and appending the 
reordered code to the end of the original text section. 
The size of the text section spccii^cd in the XCOFF 
text header is adjusted accordingly. 

3. Applying any **fix-ups*' to the branch tables (for 
switch/case statements), to the function descriptors in 
the data section (fior function calls through pointers)» 
and to any other XCOFF sections (such as debug 
information). 

4. Writing out the new executable XCOFF file image of 
the reordered program. 

TDPR results 

The results measured Cor reordering user-level applications 
arc shown in the tables which follow. The RISC 
Sysiem/^OOO® Model 530 was used for POWER 8KB 
instruction cache (SKIC) measurements, and the RISC 
System/6000 Model 570 was used for POWER 32KIC 
measurements. 

Table 2 shows the speedups measured for the 8KIC and 
32KIC POWER machines and for the F0WER2 and 601 
machines. All speedups shown were calculated by 
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corapariDg the execution thnt of the origina) program to 
that of the reordered program for the same workload on 
the same machine. Two different, commercially avaUablc, 
relational database management systems (RDBMSl and 
RDBMS2) were used for the TPC-A."' TPOB,^ and 
TPC-C™ tests. 

It is important to note that each of the programs shown 
in Tabk 2 was reordered and tested on the exact same 
workload specific to that test; results for cross-workload 
measurements arc presented in Table 5, shown later. 

Also shown in Table 2 are the results of adjusting the 
601 branch predict bh {y-h\x) uwng the actual execution 
profile data collected for these programs. The 601 
processor provides a bit m the conditional branch option 
field that allows software to adjust the branch prediction 
algorithm used for conditional branches, TDPR was not 
applied for these y-bit tests. Actual brnnch-taken/not-taken 
percentages were calculated from the execution profile 
data, and thcy-bit was adjusted accordingly to improve the 
success of hardware branch prediction- These data, along 
with the hardware monitor results shown below^ provide 
an indication of the amount of speedup due only to 
improved branch prediction. The 1% performance decrease 
seen for the SPEC^ 056.ear benchmark is apparently due 
to second-order cache and branch prediction effects. 

Hie factors contributing to the 17% speedup measured 
for the RDBMSl TPC-B test (on the POWER machine) are 
shown in Tabk 3. These measurements were taken using a 
POWER hardware performance monitor which provides 
exact counts for dock cycles, instructions executed, 
cache and TLB misses, etc., throughout the execution of 
the program* These data indicate that the application of 
TDPR provides an improvement in CPl (cycles per 
instruction) resultmg from reduced instruction cache (IC) 
and TLB miss rates, and reductions in the percentage 
of conditionally issued instructions (i.e., conditional 
branches) that were canceled (I.e., predicted Incorrealy)- 

TTie reductions in text real memory requirements f6r 
several uscr-lcvcl application programs are shown in 
TaWe 4. The changes in mranory requirements were 
calculated using two different methods (shown ^sxxfyy in 
the table). The first number (xx) represents the change in 
the total number of pages required for the exccutkm of the • 
program; the second number {yy) indicates the change 
in the maxixnum simultaneous pages required during 
execution. The increases shown ftor flwk and vi arc due to 
missed branch table modifications (as descnbcd above), 
which result in additional text memory pages touched 
during execution. However, the 61% reduction for the 
RDBMSl TPC-B test represents an instruction memory 
savings for this program of more than 512 KB- 

Applying T13PR using the methodology described herein 
does have the disadvantage of inaeasing the size of the 
executable program file (because the reordered text 



Table 3 Factors contributing to RDBMSl TPC-B speedup. 


Parameter 


Reortiered 


Original 


CPI 

TC miss 
ITLB miss 
Can/cond 


4.20% 
0.150$5. 
23.0% 


5.90% 
0.390% 
32.0% 


Tdble 4 Text working set and executable size changes (%). 
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-25/- 25 
-31/-59 
-28/-24 
-18/- 48 
-43/-61 


+ 16 
+31 
+41 
+ U 
+ 19 
+8 
+5 



is appended to the original executable). However, in 
etivironmcnts where disk space is not an extremely critical 
resource, trading additional disk storage requirements for 
both improved performance and reduced real memory 
requirements is usually desirable. The increases in 
executable file sizes arc also shown in Table 4. 

Cross^orkload effects 

One potential problem with applying TDPR is that of 
determining an appropriate workload to use while 
reordering a program. If two different workloads exercise a 
program In a completely different manner^ finding a single 
address reorder hst that is optimal for both workloads is 
improbable. For example, a program is reordered for 
workload A, and the reordered version is then run on 
workload A and resuhs in a speedup of Sd. Similarly, a 
version reordered for workload B is run on workload B 
and results in a speedup of Sb. Reordering a third version 
of the program for both workloads A and B together, 
where the workloads use and exercise the program very 
differentlyt and running that version separately on both 
workloads usually results in speedups of less than Sa and 
Sb. Also, running a reordered program on workload C, 
where workload C was not in the set of workloads used to 
reorder the program, typically also yields little (or possibly 
negative) improvement if workload C is vciy different from 
the other workloads. 

For example, Table 5 shows the cross-workload result:^ 
for reordering both the awk and ksh executable programs. 
The three reordered programs for awk are awk.hoap 
(reordered for a heapsort workload), awk.pl5 {reordered for 
an awk PTS (performance test suite) workload], and 



PAGE 37/39'RCVDAT12l27/2004 7:08:37 PM [Eastern standard Time]*SVR:USPTO-EFXR^^^^^ 



BISCM 



12/27/2004 04:06 17758240107 



ROBERT C RYAN 



PAGE 38 




TSblO 9 Crpss-workload results. 



^Vor^doiut Reordered pro^m spetdups {%\ 









awk.comb 


heapoon 
PTS 


+ 19 
+ 18 


+22 


+18 
+18 




ksh.scr 




Hsheomb 


scrl 

aum.ksh 


+21 
+11 


+3 
+45 


+ 18 

+30 



awk.comb (reordered for both the heapsort and PTS 
workloads). The reordered programs for k3h are kdh.scr 
(reordered for the ksh "buih-in" commands workload 
scrl), ksh.sum (reordered for the sequential summation 
workload sunn.ksh)^ and ksh.oomb (reordered for both 
scrl atid ^urn.Ksh workloads). 

As the data of TaWe 5 indicate, running the awk.pls 
program on the heapsort woridoad (i.e., a workload not used 
to reorder the program) actually results in a decrease in 
performance. However, tunning aw/k^heap on the PTS 
workload (again, a workload not used to reorder the 
prograni) results in an 18% speedup (slightly less than 
awk.heap run on the heapsort worWoad). The combined 
reordered aiwk (awk.oomb) produces significant speedups for 
both workloads (although awk,comb running PTS yields Jess 
improvement than awk.pts running PTS). TTic ksh ctoss- 
602 workload results are quite einular to the awk results, with 



only +3% speedup shown for ksh^sum nmning the scrl 
workload and stid significant speedups for ksh.comb on both 
workloads. However, to achieve the maxhnum performance 
improvements (at least for ksh and awk and these simple 
workloads), the program must be reordered (or the exact 
workload for whkh it is to be used. 

A potential solution to the '^cross-workload effect^* 
problem for widely varying workloads is to produce 
different versions of the prc^am which are each optimized 
for specific workload types. Then, knowing what workload 
type is to be run, the reordered version of the program 
that Is optimized for that workload type is used. 

TDPR for kerMl/kernal extensions and device 
drivers 

In addition to user-level executable programs, significant 
improvements can also be achieved by applying TDPR 
to ADC base operating system (kernel) codc^ kernel 
extensions, and device dnvers with the following special 
considerations. Implementing TDPR on executable images 
is not well suited to programs which utilize self-modi^lng 
or otherwise position^epcndent code because of the 
difficulty in delecting and correcting for modifications to 
code that has been moved. A form of position-dependent 
code can be found in system-level software (such as the 
base kernel, kernel extensions, and device drivers of AIX) 
which utili^ pinned instruction memory. Pinned memory 
is memory (in a virtual memory system) that is never 
"paged out'* (i.e., always present in real memory, 
especially during interrupts and other critical times) and, 
therefore, will never result in page faults when referenced. 

If an area of pinned instruction memory is reordered, 
the area in the reordered text section where those 
Instructions are moved must also be pinned. Since the 
granularity provided for pinning memory is usually at least 
a page, it can be quite inefficient to pin text reordered at the 
basic block level. One solution would be to pin the entire 
rcoidered text area. However, the base kernel usually has 
other position-dependent code that makes dynamic extension 
of the kernel more difficult than user-Icvelcode. 

The solution developed and implemented here relies on 
the standard practice of building the AIX kernel with 
separate pinned and pageable sections. As illustrated in 
Figure 6, the kernel is built with a sufficiently large "hole*' 
or reorder area tn the pinned section; when TDPR is 
applied, all reordered text is moved to that pinned reorder 
area (TDPR target area). Through the use of this 
technique, reordered pinned code remains pinned and 
reordered pageable code becomes pinned. 

Although one may argue tliat pinning code that was 
previously pageable reduces the effectiveness of a pageable 
kernel, a case can be made that reordered code, which is 
frequenUy executed code, should be pinned (or would be 
**paged-in" anyway) because of its utilization. 
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Debugging support 

Reordering an executable prograin as described herein 
can impose some additional requirements in the area of 
program debugging. Any debugging infonnation embedded 
in the executable file that points to code which has been 
reordered nftUSl be adjusted cither in the executable file 
(if possible) or during the debug process, Also» ATX 
executablcs contain traceback. entries at the end of 
every procedure which arc used, among other things^ lo 
detenrint the procedure name for an instniction addres-s if 
a program crash occurs. These traceback entries are not 
moved while reordering and arc therefore not present in 
the reordered code (but are )eft intact relative to the 
instructions in the original text section). 

Debugging a TDPR-reordcred executable is possible by 
utilizing a special TDPR XCOFF section created in the 
reordered executable program file which provides a cross- 
reference table containing the original and reordered 
addresses for all moved code. Using this cross-reference 
infonnation, along with the original text area which still 
has the traceback entries in place, the debugger (with 
minor modifications) can function as it would with the 
origina! program. 

Conclusions 

The application of trace-directed program irestructurlng on 
programs running in a hierarchical virtual memory system 
has the potential to produce signrficant performance 
enhancements and reductions in real memory requirements 
for both user-level and kernel programs. By using the 
prototype tools developed for this investigation, 
performance improvements for AIX executable programs 
of up to 739% and reductions in text real mcmoxy 
requirements of up to 61% were measured. For 
applications where the workloads are not critical to 
program behavior, producing a single reordered executable 
to realize these improvements should be feasible. In cases 
where different workloads change program behavior 
dramatically ; providing multiple exccu tables (each 
reordered for a Specific workload type) or reordering for 
the most common workload may still prove beneficial. 

The techniques described herein have been implemented 
in the IBM AtX software product "FDPR" {feedback 
directed program restructuring); preliminary results 
indicate significant performance improvements for a 
variety of programs. 

Opportunities for additional work in this area include 
the development of "optimal" algorithms for reorder-llst 
generation, includine techniques to maintain pre-existing 
compiler optimizat*on5 and direct optimization for N-wzy 
set-associatfve cache col]ision&, raulti- workload 
optimizations^ tind data reordering. 
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